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Abstract: A representation of the genetic code as a six-dimensional Boolean 
hypercube is proposed. It is assumed here that this structure is the result of the 
hierarchical order of the interaction energies of the bases in codon-anticodon 
recognition. The proposed structure demonstrates that in the genetic code 
there is a balance between conservatism and innovation. Comparing aligned 
positions in homologous protein sequences two different behaviors are found: 
a) There are sites in which the different amino acids present may be explained 
by one or two "attractor nodes" (coding for the dominating amino acid(s)) 
and their one-bit neighbors in the codon hypercube, and b) There are sites in 
which the amino acids present correspond to codons located in closed paths 
in the hypercube. The structure of the code facilitates evolution: the variation 
found at the variable positions of proteins do not corresponds to random jumps 
at the codon level, but to well defined regions of the hypercube. 
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1. Introduction 



The genetic code is the biochemical system for gene expression. It deals with 
the translation, or decoding, of information contained in the primary struc- 
ture of DNA and RNA molecules into protein sequences. Therefore, the ge- 
netic code is both a physico-chemical and a communication system. Physically 
molecular recognition depends on the degree of complementarity between the 
interacting molecular surfaces (by means of weak interactions); information- 
ally, a prerequisite to define a code is the concept of distinguishability. It is 
the physical indistinguishability of some codon-anticodon interaction energies 
that makes the codons synonymous, and the code degenerate and redundant 

i- 

In natural languages as well as in the genetic code, the total redundancy 
is due to a hierarchy of constraints acting one upon another. The specific way 
in which the code departs from randomness is, by definition, its structure. 
It is assumed here that this structure is the result of the hierarchical order 
of the interaction energies of the bases in codon-anticodon recognition. As 
we shall see, it may be represented by a six-dimensional boolean hypercube 
in which the codons (actually the code-words; see below) occupy the vertices 
(nodes) in such a way that all kinship neighborhoods are correctly represented. 
This approach is a particular application to binary sequences of length six of 
the general concept of sequence-space, first introduced in coding theory by 
Hamming [Q. 

A code-word is next to six nodes representing codons differing in a single prop- 
erty. Thus the hypercube simultaneously represents the whole set of codons 
and keeps track of which codons are one-bit neighbors of each other. Different 
hyperplanes correspond to the four stages of the evolution of the code accord- 
ing to the Co-evolution Theory 0-0. Hops within three of the "columns" 
(four-dimensional cubes), consisting of the codon classes NGN, NAN, NCN, 
and NUN, lead to silent and conservative amino acid substitutions, while hops 
in the same hyperplane (four-dimensional subspace belonging to any of the 
codon classes ANN, CNN, GNN or UNN) lead to non-conservative substi- 
tutions, frequently found in proteins. The proposed structure demonstrates 
that in the genetic code there is a good balance between conservatism and 
innovation. To illustrate the results several examples of the non-conservative 
variable positions of homologous proteins are discussed. Two different behav- 
iors are found: a) There are sites in which the different amino acids present 
may be explained by one or two "attractor nodes" (coding for the dominating 
amino acid(s)) and their one-bit neighbors in the codon hypercube, and b) 
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There are sites in which the amino acids present correspond to codons located 
in closed paths in the hypercube. 



2. Codon-Anticodon Interaction 

In his early paper Eigen recognized that the optimization between stability 
and rate, that is always found for enzyme-substrate interactions, also applies 
to the CO don-ant ico don interaction. However, he attributes the codons size to 
mechanistic coincidences: "codons with less than three bases would be very 
unstable (at least for A and U). Codons with more than three bases, especially 
for G and C, become too 'sticky' " . This is certainly not a coincidence, but a 
requirement for the system to function as an efficient communication device. 
Three bases are needed for effectively binding the adapter to the messanger. 
Thus, the codons size determines the range of codon-anticodon overall inter- 
action strength within which recognition can occurQ Genetic translation rate 
is limited, among other things, by codon-anticodon recognition which in turn 
depends on base-pair lifetimes in a given structural situation. These life-times 
are influenced by the nature of the pairs: they are shorter for AT than for GC 
pairs 0. 

The four bases occurring in DNA (RNA) macromolecules define the corre- 
sponding alphabet X: {A, C, G, T} or {A, C, G, U}. Each base is completely 
specified by two independent dichotomic categorizations (Fig. |I]): 

(i) according to chemical type C : {R, Y}, where R: (A, G) are purines and 
Y:(C, U) are pyrimidines, and (ii) according to H-bonding, H : {W, S}, where 
W:(A, U) are weak and S:(C, G) strong bases. The third possible partition 
into imino/keto bases is not independent from the former ones. 

Denoting by Cj the chemical type and by Hi the if-bond category of the 
base Bi, at position z of a codon, our basic assumption says that the codon- 
anticodon interaction energy obeys the following hierarchical order: 

C2> H2> C,> H^> Cs> Hi 

This means, that the most important characteristic determining the codon- 
anticodon interaction is the chemical type of the base in the second position. 

^ Interestingly enough, this feature of genetic communication system has, its counterpart 
in human communication. In a series of experiments on reading hsts of words, performed 
by J.E. Karhn and J.R. Pierce (Pierce J.R., An Introduction to Information Theory, Dover 
Pubhcations,Inv. N.Y. 1961), in which the subject "transmits" the information translat- 
ing it into the new form, speech rather than print, by reading the list aloud, they con- 
cluded that: "It seems fairly clear that reading speed is limited by word recognition not by 
word utterance " (underlined in the original). See also |^ 
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Fig. 1. Categorizations of the bases. The categorizations of the bases according to 
(i): chemical type C : {R, Y}, where R: (A, G) are purines and Y: (C, U) are 
pyrimidines, and (ii) according to H-bonding, H : {W, S}, where W: (A, U) are 
weak and S: (C, G) strong bases. The third possible partition into imino/keto bases 
is not independent on the former ones and is irrelevant for the codon-anticodon 
interaction. The binary representation of the bases is also shown. The first bit is 
the chemical type and the second one the H-bonding character, a, (3 and 7 are the 
transformations of the bases, which form a Klein-4 group [6,8]. 



The next most important characteristic is whether there is a weak or strong 
base in this position, then the chemical type of the first base and so on. 

The bases are represented by the nodes of a 2-cube (Fig. 0). The first at- 
tribute is the chemical character and the second the hydrogen-bond charac- 
ter. Extending this association to base triplets, each codon is in a unique way 
associated with a codeword consisting of six attribute values (see Table |1|) . 

In some of the hypercube directions single feature codon changes (one-bit 
code-word changes) produce synonymous or conservative amino acid substi- 
tutions in the corresponding protein (when the hops occur in three of the 
4-cubes displayed as "columns" in Figs. || and ||) while in other directions 
lead to context dependent replacements which in general conserve only cer- 
tain physical properties. However, if these properties are the only relevant ones 
in the given context, the substitution has little effect on the protein structure 
as well. These low-constraint sites facilitate evolution because they allow the 
transit between hypercube columns belonging to amino acids with very dif- 
ferent physico-chemical properties (e.g. hydrophobic and hydrophilic amino 
acids, respectively). 
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Table 1 

Gray code representation of the genetic code. In the first and fourth blocks the 
6-binary vectors (code-words) are shown. In the second and fifth blocks appear the 
corresponding codons. Finally in the third and sixth columns the amino acids in 
single letter notation. The first two digits correspond to the first base, the following 
two to the second base and the last two to the last base, according to the binary 
codification of the bases of Fig. || 



3. Gray Code Structure of the Genetic Code 



An n-dimensional hypercube, denoted by Q„, consists of 2" nodes, each ad- 
dressed by a unique n-hit identification number. A link exists between two 
nodes of Qn if and only if their node addresses differ in exactly one bit po- 
sition. A link is said to be along dimension i if it connects two nodes whose 
addresses differ to as the zth bit (where the least significant bit is referred to as 
the 0th bit). Qq is illustrated in Fig. Two nodes in a hypercube are said to 
be adjacent if there is a link present between them. The (Hamming) distance 
between any two cube nodes is the number of bits differing in their addresses. 
The number of hops needed to reach a node from another node equals the 
distance between the two nodes. A (i-dimensional subcube in Qn involves 2'^ 
nodes whose addresses belong to a sequence of n symbols {0, 1, *} in which 
exactly d of them are of the symbol "*" (i.e. they dont care symbol whose 
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value can be or 1). 




Fig. 2. The six-dimensional hypercube. Each node is labeled with the corresponding 
amino acid, in the single letter notation, or terminator symbol. For clarity, only some 
of the links are shown. The cluster of amino acids of the first example discussed in 
the text and their links are displayed. 



The idea to propose a Gray Code representation of the Genetic Code goes 
back to Swanson [|1^ where this concept is explained in detail (see also [pT|] ). 
A great number of different Gray Codes can be associated to the Genetic 
Code, depending of the order of importance of the bits in a code-word. In 
Table |l] our chosen Gray Code is displayed. It is constructed according to our 
main hypothesis 



C2> H2> Ci> Hi> C3> H3 



For example, the first two lines of the table differ in the last bit, corresponding 
to i/3 which is the least significant bit; the second and the third lines differ in 
the next least significant bit, i.e. C3, and so forth. 
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Fig. 3. The hypercube representation of the genetic code. Each node represents a 
code-word (6-binary vector) of attribute values. However, for clarity of interpre- 
tation, the nodes are labeled with the corresponding codons (See Table || for the 
assignment of codons to vectors). The nodes and links mentioned in second example 
discussed in the text are shown. 



4. The Structure of Codon Doublets 



This section is more mathematical than the rest of the paper, therefore it is 
suggested to non-mathematical readers to skip the details. This will not be 
an obstacle for the understanding of the rest of the paper. 



In a pioneering paper Danckwerts and Neubert |T^ discussed the symmetries 
of the sixteen B1B2 codon doublets in terms of the Klein-4 group of base 
transformations. Here their result will be recast in a form of a decision-tree 
(Fig. ^), and their analysis will be extended to the B2B3 doublets. 



They found the following structure for the set M of B1B2 doublets: 
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Fig. 4. Decision-tree of codon categories and redundancy distribution. The leaves are 
the sets of four-fold (Mi) and less than four-fold (M2) degenerate B1B2 doublets. 

Starting from AC generate the set: 

Mo = {[(1, 1) U (a, 1) U (a, 0) U (a, i)]AC} = {AC, CC, CG, CU} 
Ml = [(l,l)U(/9,l)]Mo and 
M2 = (a, a)Mi 

The sets Mi and M2 consist of four-fold and less than four-fold degenerate 
doublets, respectively. 

The set M can be expressed as: 

M= [(1,1)U(/9,1)] [(l,l)U(a,a)]Mo 

Where the base exchange operators (a, /5, 7) are defined in Fig. 

They showed that: "Mi and M2 are invariant by operating with 1) on 
Bi, but no operation on B2 leaves Mi or M2 invariant. Thus B2 carries more 
information^ than Bi and B2 is therefore more important for the stability 
of Ml and M2 than Bi. A change of Bi with respect to its hydrogen bond 
property does not change the resulting amino acids if all doublets of either Mi 
or M2 are affected. 



Reversing supposition and conclusion, Mi and M2 may be defined as those 
doublet sets of 8 elements which are invariant under the l)-transformation. 
Then experience shows that Mi and M2 are fourfold and less than fourfold 
degenerate respectively." 

Thus, the third base degeneracy of a codon does not depend on the exact base 
Bi, but only on its iJ-bond property (weak or strong). 

^ see also Q 
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The above results can be simply visualized as a decision-tree (Fig. ^). As can 
be seen from this figure, the redundancy of a codon is determined only by the 
H-bond character of Bi and B2: SSN codons (with 6 H-bonds in B1B2) belong 
to Ml while WWN codons (with 4 H-bonds in B1B2) belong to M2. However, 
for codons WSN and SWN (with 5 H-bonds in B1B2) it is not possible to 
decide unless one has more information about the second base: WCN and 
SUN belong to Mi while WGN and SAN belong to M2. In all cases at most 
three attributes are necessary to determine the redundancy of a codon up to 
this point, of course, the non-degenerate codons (UAG for Methionine and 
UGG for Thryptophan) will require the specification of the six attributes. 

From the decision rules obtained from Fig. ^ it is clear that there are branches 
where the refinement procedure cannot continue (the branches which end in 
Ml) because no matter which base occupies the third codon position the 
degeneracy cannot be lifted. This imposes a limit to the maximum number 
of amino acids which can be incorporated to the code without recurring to 
a "frozen accident" hypothesis. Our proposal generalizes the "2-out-of-3" 
hypothesis of Lagerkvist [Q, which refers only to codons in the SSN class. 

The sixteen B1B2 doublets can be represented as the vertices of a four- 
dimensional hypercube (Fig. As can be seen from this figure, the sets 
Ml and M2 are located in compact regions. Notice that this figure differs from 



the one introduced by Bertman and Jungck [|13|, which considered as basic 
transformations a and (3 instead of (3 and 7 as we did. Since the operator a 
changes two bits we do not consider it as basic. 




Lets consider now the structure of the set M' of S2-B3 doublets. Exactly as 
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before, define the following sets: 



= {NC} 
M[ = [(1,1)U(1,/3)]M^ and 
M2 = {a,a)M[ (alternatively M[ = {a,a)M'^ ) 

where M[ consists of the doublets ending in a strong base (NS), and 
of the doublets ending in a weak base (NW). Then 

M' = M[ U M'2 can be expressed as 
M' = [(1,1)U(1,/?)] [(l,l)U(a,a)]M^ 

Notice that the operator acting on Mq has the same form as the operator 
acting on Mq above, except that b acts as the third base instead of the first. 

The sets M[ and are invariant under the (1, b)-transformations. Then 
experience shows that the 32 codons in the class with B2B3 in M[ or 

M2 constitute a complete code, codifying for the 20 amino acids and termina- 
tor signal (stop-codon) if allowance is made for deviating co don-assignments 
found in Mitochondria . For the codons in M[ this is true in the universal 
code; for codons in AUA should codify for M instead of I and UGA for W 
instead of stop signal. Both changes have been observed in Mitochodria. This 
more symmetric code has been considered more similar to an archetypal code 
than the universal code [0]. Only after the last attribute was introduced 



the universal code was obtained with the split of AUR into AUA (I) and AUG 
(M); and UGR into UGG (W) and UGA (t). It has been speculated that pri- 
mordial genes could be included in a 0.55-kb open reading frame |]TB[. The 



same authors calculated that with two stop codons this open reading frames 
would have appeared too frequently. From the present view the assignment 
of UGA to a stop codon was a late event that optimized this frequency (this 
interpretation differs from the one proposed in [|l^ and [T^ which assume a 



primordial code with three stop codons). Other deviations of the universal 
code most likely also occurred in the last stages of the codes evolution. 

In the same way as before the sixteen B2BS doublets can be represented as 
the vertices of a four-dimensional hypercube (fig. |^). The sets M(and 
are also located in compact regions. Codons with B2B3 in M[ are frequently 
used in eukaryots. In contrary, codons with -B2-B3 in M2 are frequently used 
in prokaryots. The described structure of the code allows a modulation of the 



codon-anticodon interaction energy . 
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Fig. 6. The corresponding to fig. |5| hypercube of the sets Af{ and M^. Notice that 
in both cases each set is located in a compact region. 



5. Results 



Besides the results mentioned in the last section which refer to codon doublets, 
to further illustrate the significance of proposed approach we are going to con- 
sider several examples. In the first example (Fig. 0) we discuss the alignment 
studied by the method of hierarchical analysis of residue conservation of Liv- 
ingstone and Barton (Fig. 2 of [@). In position 11 appear the following amino 
acids: R, W, H, G, D, which according to their approach have no properties 
in common. In Fig ^ this cluster of amino acids is shown. By looking at the 
Atlas of amino acid properties |]T9[ we see that, from the properties proposed 
by Grantham (composition, polarity and volume), apparently the only 
requirement for the amino acids at this site is to maintain a certain degree of 
polarity. From this observation we may conclude that most probably it is an 
external site. Simply by looking at such a diverse set of amino acids one can 
hardly realize that they have clustered codons. This clustering facilitates the 
occurrence of mutations that in the course of evolution were fixed, in view of 
the low physico-chemical requirements at the site. 

As a second example (Fig. H) we consider site 33 of the alignment of 67 SH2 
domains. Fig 6 of We can see from Fig. § that the cluster around the 
codon GAG (H) explains, by one-bit changes, the amino acids R, Q, L, H, 
D. Furthermore, a second cluster around the codon AGG (S) explains the 
amino acids R, N, S, T. Finally, a silent change from AGG (S) to UGG (S) 
accounts for the minor appearance of the small neutral amino acids A, T, P. 
In a similar way, the variation of the hypervariable region of immunoglobulin 
kappa light FRl at position 18 can be explained (Fig. 0. Finally, by looking 
to the residue frequencies in 226 globins displayed in Table 3 of the paper 
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Fig. 7. The amino acid hypercube with the amino acids at position 18 of the variable 
region of kappa hght chain displayed. The number after the amino acid symbol is 
the number of times the amino acid occurs in the alignment of Kabat et al. (1991): 
Sequences of proteins of immunological interest, 5th ed. NIH, Bethesda, MD. 



by Bashford et al. it is seen that there are variable positions in which 
one or two residues predominately occur and the rest are only marginally 
represented, and others in which the frequencies are more evenly distributed 
among the amino acids present. As it can be easily shown, the first class of 
positions may be associated at the codon level with one (or two) attractor 
node(s) and its one-bit neighbors, and the second one with closed trajectories 
in the hypercube. The corresponding figures are not included because of lack 
of space. 



6. Concluding Remarks 



The present approach goes beyond the usual analyses in terms of single base 
changes because it takes into account the two characters of each base and 
therefore it represents one-bit changes. Besides, the base position within the 
codon is also considered. The fact that single bit mutations occur frequently 
is expected from probabilistic arguments. However, one could not expect, a 
priori, that a cluster of mutations would correspond at the amino acid level 
to a cluster of amino acids fixed by natural selection. We have found that this 
situation presents itself for many positions of homologous protein sequences 
of many different families (results not included). The structure of the code 
facilitates evolution: the variations found at the variable positions of proteins 
do not corresponds to random jumps at the codon level, but to well defined 
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regions of the hypercube. 
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